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Use of a Drosophila Genome- Wide Conserved 
Sequence Database to Identify Functionally 
Related cis-Regulatory Enhancers 
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Jermaine Ross, 1 Tzu-Yang Lin, 3 Chi-Hon Lee, 3 Takeshi Awasaki, 4 Tzumin Lee, 4 
and Ward F. Odenwald 1 * 



Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved 
DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and anal- 
ysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have gener- 
ated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from 
EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable 
the discovery of functionally related enhancers. The program first identifies conserved repeat elements 
within an input enhancer and then searches the database for CSCs that score highly against the input CSC. 
Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the bal- 
ance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To 
demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify 
other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder 
reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, pro- 
viding insights into the mode of integration of multiple regulating transcription factors. The database and 
accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any 
developmental process. Developmental Dynamics 241:169-189, 2012. © 2011 Wiley Periodicals, Inc. 
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Key findings: 

• A genome-wide catalog of Drosophila conserved DNA sequence clusters. 

• cis-Decoder discovers functionally related enhancers. 

• Functionally related enhancers share balanced sequence element copy numbers. 

• Many enhancers function during multiple phases of development. 
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INTRODUCTION 

Understanding the mechanisms of 
dynamic gene expression remains a 
major goal of developmental biology. 



Previous studies have shown that 
many of the different spatial-temporal 
aspects of gene regulation are con- 
trolled by multiple, functionally inde- 
pendent cts-regulatory modules or 



enhancers (review by Bulger and 
Groudine, 2011). These studies have 
also identified several key characteris- 
tics of enhancers including their abil- 
ity to act at some distance from the 
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genes that they regulate, their posi- 
tional independence relative to tran- 
scription direction of the regulated 
gene, and their ability to function from 
within transcribed sequences (reviewed 
by Davidson, 2001). Functional analy- 
sis of in vivo characterized enhancers 
has also revealed that they typically 
span 300 to 2,000 bp and contain clus- 
ters of DNA-binding sites for sequence- 
specific DNA-binding transcription fac- 
tors (reviewed by Alonso et al., 2009). 
More recent studies indicate that some 
enhancers are regulated by chromatin 
DNA modifications and/or alterations 
in higher-order chromatin structure 
(reviewed by Suganuma and Workman, 
2011). 

The availability of genomic sequen- 
ces from evolutionarily related spe- 
cies allows for the comparison of 
orthologous DNAs, via phylogenetic 
footprinting, to identify functionally 
important conserved sequences within 
enhancers (reviewed by Visel et al., 
2007; King et al., 2007; Meireles-Filho 
and Stark, 2009; Alonso et al., 2009). 
The conserved enhancer sequence 
complexity suggests that they inte- 
grate multiple regulatory inputs via 
different sequence-specific DNA-bind- 
ing factors (Kuo et al., 1998; Berman 
et al., 2004, Brody et al., 2007). One 
of the hallmarks of developmental 
enhancers is the presence of repeated 
DNA-binding sites for essential tran- 
scription factors (Small et al., 1992; 
Davidson, 1999, Berman et al., 2002, 
2004; Gaul, 2010). For example, mul- 
tiple conserved DNA-binding sites for 
Hunchback have been identified within 
Drosophila segmentation enhancers 
(Papatsenko et al., 2009), multiple 
bHLH DNA-binding sites are found 
within neural precursor cell enhancers 
(Brody et al., 2007; Kuzin et al, 2009), 
and similarly for Runt-, Ets-, and 
Smad-responsive enhancers in mam- 
mals (Bowers et al., 2010; Babayeva 
et al., 2010; Nakahiro et al., 2010). 
Studies have also shown that altering 
the copy number of transcription fac- 
tor docking sites by adding or deleting 
multi-copy sequence motifs can alter 
enhancer behavior. This suggests that 
such repeat motifs are not necessarily 
redundant but each conserved copy 
may have an integral role in enhancer 
function (Kuzin et al., 2011). In addi- 
tion, studies on sequentially arrayed 
or clustered Drosophila enhancers 



have shown that individual enhancers 
are flanked by sequences referred to 
as spacers (Small et al., 1993). Com- 
parative genome analysis of spacer 
regions, termed here inter-clustal 
regions (ICRs), reveals that they ex- 
hibit a higher level of interspecies 
sequence length variability than do 
the less-conserved sequences within 
enhancer-conserved sequence clusters 
(CSCs) (Kuzin et al., 2009), thus pro- 
viding a useful method for delimiting 
the boundaries of enhancers. 

Our previous work has described 
EvoPrinter, a phylogenetic footprint- 
ing tool for discovering conserved 
sequences that are shared among 
orthologous DNAs (Odenwald et al., 
2005; Yavatkar et al., 2008). The out- 
put of EvoPrinter, an evolutionary 
gene print or EvoPrint, portrays in a 
single readout the conserved DNA 
within a species of interest, thus high- 
lighting conservation in a continuous 
gap-free sequence that facilitates the 
further comparative analysis of 
enhancer sub-structural organization 
as well as the discovery of novel 
enhancers (see below). We have also 
developed a set of integrated alignment 
algorithms, collectively known as cis- 
Decoder, that identify multi-copy and 
unique elements within CSCs that are 
shared with other CSCs (Brody et al., 
2007, 2008). 

To increase our understanding of 
enhancer sub-structure and to iden- 
tify families of functionally related 
enhancers via comparative analysis, we 
constructed a web-accessible genome- 
wide database of Drosophila CSCs that 
includes, in addition, CSCs within most 
in vivo characterized enhancers. Also 
described are additional cis-Decoder 
search algorithms that facilitate the dis- 
covery of database CSCs related to any 
input enhancer. Once the user inputs 
an EvoPrinted enhancer, cis-Decoder 
algorithms scan the database to detect 
structurally related CSCs using a 
three-step protocol: the initial search 
identifies database CSCs that share 
conserved multi-copy elements with the 
input sequence; the program then iden- 
tifies unique elements shared between 
the input enhancer and database CSCs; 
and finally the copy number of shared 
elements is evaluated to generate 
ranked similarity scores that relate the 
input enhancer to the database CSCs. 
To demonstrate the efficacy of this 



approach, which makes no assumptions 
about the function of individual 
sequence elements, we have utilized an 
enhancer of castor (cas), a late temporal 
neuroblast (NB) determinant (Mellerick 
et al., 1992; Cui and Doe, 1992; Kamba- 
dur et al., 1998), to identify previously 
uncharacterized late NB enhancers. We 
also show how ess-Decoder searches can 
identify multiple previously character- 
ized cellular gap enhancers based on 
their shared sequence motifs and also 
identify shared overlapping transcrip- 
tion factor-binding sites. 

Our comparative analysis of 
enhancers also reveals that there is 
no single combination of DNA-binding 
sites of known regulators or novel 
conserved sequence elements that can 
accurately predict enhancer regula- 
tory behavior. However, enhancers 
that have a balance in copy number of 
shared sequence elements are more 
likely to exhibit similar regulatory 
activities. Although enhancers with 
similar regulatory behaviors share 
both multi-copy sequence motifs and 
unique conserved sequence elements 
that are balanced in copy number, 
arrangement of these shared ele- 
ments differs between enhancers. Our 
studies also demonstrate that many 
enhancers are multifunctional; they 
regulate gene expression during dif- 
ferent temporal phases of develop- 
ment. No other comparative align- 
ment program allows for the user to 
generate an inventory of conserved 
repeat and unique sequences that are 
shared between CSCs, an essential 
step in the analysis of their structure. 
Since the database includes most of 
the genomic repertoire of CSCs, these 
tools should serve to help in the fur- 
ther analysis of other novel function- 
ally important sequences and in the 
discovery of enhancers that drive gene 
expression during any developmental 
process or biological event. To our 
knowledge, this is the first systematic 
catalog of conserved DNA sequences 
within any phylogenetic group. 



RESULTS AND DISCUSSION 

Generation of a Genome-Wide 
CSC Database 

DNA sequence conservation histo- 
grams of the Drosophila genome 
reveal that its non-coding DNA is 
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made up of CSCs that are flanked by 
less-conserved ICR DNA (Karolchik 
et al., 2007). For example, a conserva- 
tion histogram of the Drosophila mel- 
anogaster vvl gene transcribed region 
and 60 kb of 3' flanking DNA (located 
on the 3L chromosome) identifies 
multiple peaks of conserved DNA that 
are flanked by less conserved DNA 
sequences (Fig. 1A). EvoPrint analy- 
sis reveals that the CSCs can be fur- 
ther resolved into multiple smaller 
conserved sequence blocks (CSBs) 
(Fig. IB). Most regions of chromo- 
somes 2 and 3 gave a similar pattern 
of CSC density and distribution, while 
in general CSCs on the X and the 4th 
chromosomes exhibited less conserva- 
tion among the twelve species, cis- 
Decoder alignment of CSBs constitut- 
ing a CSC identifies both repeat and 
palindromic sequence (RPS) elements, 
of > 6 bp in length, and reveals that 
these account for more than half of the 
CSCs conserved sequences (Fig. IB). 
The 6.4-kb genomic region shown in 
Figure IB was selected because two of 
its CSCs (vvl-41 and vvl-43) were tested 
for their regulatory behavior in this 
study (see below). Our previous analy- 
sis of enhancer sequence conservation 
has shown that individual enhancers 
can be identified by the maintenance of 
their CSB cluster integrity across Dro- 
sophila species, while ICR regions 
show greater sequence length variabili- 
ty (Kuzin et al., 2009). 

As a first step in the identification 
of structurally related CSCs, a ge- 
nome-wide database of Drosophila 
CSCs was created by EvoPrinting 
most of the euchromatic genome of 
Drosophila melanogaster and nearly 
all of the previously in vivo character- 
ized enhancers that are included in 
the REDfly database (Gallo et al., 
2006). Database CSCs were extracted 
from more than 4,000 author-gener- 
ated EvoPrints that generally spanned 
15-30 kb of genomic DNA. EvoPrints 
of fewer bases were used depending on 
genomic context and availability of 
gap-free sequence data in the ortholo- 
gous regions of the different species. 
Most EvoPrints included all of the 
available melanogaster group droso- 
philids (D. melanogaster, D. simulins, 
D. sechellia, D. yakuba, D. erecta, and 
D. ananassae), one of the obscura 
group (D. pseudoobscura orD. persimi- 
lis), and two to four orthologous 



regions selected from the more evolu- 
tionary distant species: D. willistoni, 
D. virilis, D. mojavensis, and/or 
D. grimshawi species. Most of the 
EvoPrints represented a combined 
evolutionary divergence of >150 My 
(Tamura et al., 2004). Under these 
conditions, open reading frames that 
encode conserved protein domains do 
not show conservation in most of the 
codon wobble positions, indicating 
that the additive evolutionary diver- 
gence represented in each EvoPrint is 
sufficient to reveal with near base- 
pair resolution those sequences that 
are essential for gene function (Oden- 
wald et al., 2005). EvoPrints of open 
reading frames, using different combi- 
nations of species, reveal that the lack 
of sequence conservation in the amino 
acid codon wobble position is not the 
result of different codon preferences 
between species (data not shown). 

To enhance the detection of con- 
served DNA and avoid alignment 
inaccuracies triggered by DNA 
sequencing errors, sequencing gaps, 
rearrangements, or genome assembly 
problems that were unique to any one 
of the species used in the analysis, we 
employed relaxed EvoPrint readouts 
to identify CSCs. A relaxed EvoPrint 
highlights sequences that are present 
in all or all but one of the orthologous 
DNAs used to generate the print 
(Yavatkar et al., 2008). Species with 
sequencing gaps (identified as blocks 
of species-specific differences in the 
color-coded relaxed EvoPrint readouts 
or identified as gaps in the EvoPrinter 
scorecard) were avoided in generating 
EvoPrints, and second and third scor- 
ing pair-wise alignments were 
included in the analysis when rear- 
rangements were detected (Yavatkar 
et al., 2008). 

To catalogue CSCs, EvoPrints were 
entered into the EvoPrint CSC cutter 
algorithm to isolate and annotate 
individual CSCs separated by at least 
150 bp of less-conserved DNA. This 
program also assigns a file name and 
consecutive numbers to each CSC in 
an EvoPrint. In order to insure that 
enhancers that contain CSB separa- 
tion gaps of 150 bases or more were 
not truncated, CSCs were also parsed 
independently two additional times 
using ICR cutoffs of 200 and 250 bp. 
Duplicates are given the same name 
but an additional notation to distin- 



guish them. Therefore, clusters that 
were parsed multiple times (~20% of 
the database CSCs), due to their hav- 
ing non-conserved intervals >150 or 
>200 but <250 bases, are present two 
or three times in the database. The 
database contains > 100,000 non- 
redundant clusters. To expedite data- 
base searches, in addition to catalog- 
ing individual CSCs and their CSBs, 
RPS elements of 6 bp or longer were 
pre-identified by intra-CSC CSB 
alignments and stored in the data- 
base. Most CSCs that contain more 
than 150 bp of conserved DNA have 
RPS elements that account for >50 % 
of their sequences (for example see 
Fig. IB; see also Fig. 2B). 

The CSC database contains two 
types of file entries: (1) ~2,000 files 
originated from genomic regions span- 
ning previously characterized genes 
and (2) ~1,000 entries consisted of 
genomic regions that cover more than 
one known or predicted gene or large 
regions of CSCs not associated with 
flanking genes. Genomic regions that 
contain highly repetitive DNA sequen- 
ces that lack identifiable sequence con- 
servation, such as most of chromosome 
4 and specific regions of the X, were 
not included in the database. Care was 
taken when annotating the clusters to 
identify CSCs within non-coding, cod- 
ing, and 3'UTR regions. Database 
searches can be modified to include all 
CSCs or focus on just coding or non- 
coding regions. It is important to note 
that CSCs were named according to 
their proximity to genes: whether the 
CSCs are indeed enhancers for nearby 
genes requires functional tests and 
knowledge of endogenous gene 
expression patterns. To allow the user 
to find the location of the CSC relative 
to flanking genes, a link is provided to 
the UCSC BLAT server (located on 
the one-on-one alignment results 
web-page; see the online «s-Decoder 
tutorial). The UCSC browser also pro- 
vides information concerning chroma- 
tin accessibility and transcription fac- 
tor-binding data. 

We have also included CSCs from 
all previously in vivo characterized 
enhancers by EvoPrinting all entries 
in the REDfly database (Gallo et al., 
2006); these are identified in the CSC- 
database by their REDfly designa- 
tions. Although most of these CSCs 
duplicate database entries, CSCs that 
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:gaagg 4 

aagctgccgtaagaggtaaaacat a ttgiggccagaagaa- *-gagagggltaagagaq 

gcacaaaggg- '.tatat-aattatt agctct t t.gtttt.gctatc t taaatt c ttgccacagctctaagatgt ttgcaa 

^^^^^ CAATCTGAG 
CC GTT T TTTATCTT GCCA CGA AC TIT T T ATTT ATTTTG G CTACC T GTC 

TGTC T GTTGC A ACTGTT G A' l ' l ' JLT A T TTC AT G CCGTTT A TGCAAA T AAGTTCAA AGGATAAGGCCAAAAC AGCAGCAG 
GACATAAAAATTATTTCT- AATGA A ATrCGT TG CCGAAATTCG ATCGATG 

ATGCAAATGGCATGCC AAA A AAGAC C GGTAGC C GA AAA AAA TG TGTGA 

TATCGAATTAA T T AAAAA 

TGGCGCCAAATOTAAATGAGGCA GAAATTAG TG GCACCTTGTG GGCC TTCAATTTCGC 

GCTGATTTG C G GTC T AGTGCGCATGCGTGTTGT 

GGCGAC TGCAACAAGTTAATTAAGCC ■■■CATG 
ACAACCCACACTTCCTGT AAA ACGTGTC CAGGCGGC TC 

CGATCGATC ACATGCTG GA CTGCGGTAATCCGCACTTCCACGCCTTA 
GTCGCATTTTC CGCGCCCCTTT TCC AG A CAGGC A G TTTATTT T ATGA T TGATTG C AAATA A 

AAATCGCAAAT TG T TGTTTACTTTT G 

GGTGGTGGTT T ATTGGAC ATATGTCG TGAC 

GTG CTGCTACTGT GCTAAGTAGCC GGGCGTTA TTT C TTTT TTTTTC T 

T T A A TTATTG T C A CGCATATG TCA ATAT 

gttgtcaaact tic* 'at tagctgaaaagt c actataaataggt aaacac tutctattatatc 

atattatatactttaaagttaaatcataaaacccactgaatcaacagacatcgcctgcgatgggctataaat taatat t tacagccaatccgtataat 
:cgaagcaa :att,gattga< 
ctgcatcacct tgagecaaacgag taagttggccgcaa tcggcgggccatcacggatgccaattcaaattgactttaatcagc; aatgt 
gaaaaaaaaaac tq cagqccqtcc qccaqqt t aaaagt t gttgacgatgagcaat egg c accgcc g cagccacctga ac cacctq aaa cac tcaacc a 

GCGGCGCCTGTCAAAA TTG C ATCAAGAATTTATTTC G ATAGCATTGAAATTGTTTA ACAACA C 

CA AA T GCGTTG C AC ATG C GC AGC 



wl-41 



TAAATTTATGTC CACCGGT AGACC TTTACTT GC GCGGCTAGA AACGCGC 
CTTCGATCGGG G CGATC 

TGTT CGGTGGGGGTCC AAGTGTTGAG C ATTTCAGGTTTC 

AGTTTTGTGCTTGAC ACG C GGCATT GATTTCGC C GATA 

TGTAG AAACTTG ACAAATA TTTAC GGTACTC 

TTAGCAAACAAG 

t.aaacat.agcgc ttggaaaaagtt agggtgt ccatct : 
aaagctaatat agggttt cagaac tatataic gtcaaa t tgaaaagatatc 

aaaaatatattgaattgee tttgac ttcaat t aaaata cartigaaaatta 
ttactacaccttto: aaaattt.gtaigaatacatta 
\iaaal a tttaag a gcatc l gc- igcccaa : 

GGAA 

AAAA GTC ATT GAT AGC 



CG TTT TGC CATAGA T AAAAAA T ACACC T GTCGCA T TACAG C 

TGTGTGT GAC TGC ACG 
T G 

GATGCGTTGCC CTGGCC GTCTGGGCCGC 
G AAATAA ATACATA AAATTC G ATCGA A 

ATGACGTCAA GTCT 
AACTGCTAACTGTT GATCT GA 

~gaaacaaacactgcac :caac 
t acaat aaatata t atatat a tatat a taataag a ■ 
t tttgt t taaccgaccc.ct g t tgtaagacttcc tgaat 
at'.gctt tcLgcaatlaaaaa: 

ugat'.aaatic '.tccatataacat 
tatttctcattga aatactq c tttct t atatqtt aaa'.at 

TGTATTT CTTGG C CCATAG C TG 

AGCTC A TTT AT C 

ctcac -ccagtggatggatgaatcatgg 
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AAAAGGTCA 



CCAA ATACATA ATTA G CGG GCATCCCTTAGA GTT 

G TATCT TAAAC G CTATTATTA GGTCACAA ATCAATG AAGACTTTATCGA 

GCGAC TCATTAC A AC AAA T CTGGG CT GATTACAATAA A GCCGCG 
CGA ACTAAAATTTATTGACAATT 
GAGAAAGGT AGAAA 
C CATAAATTT 

CGATC AAACGGATC TTAGAC CCGCCGT ( 
A CCAATAATTATGC AAAT 
ATTTGCAT T AACA C GCGTCG 

•I'l'l' TTAG A GATTTT 

TTATGTAGGTC A AAAATT G TTT AT T GAC ACG C AA G 

CC AAC A GCTGTT A ATTATA G ^^^^ 
GC GGC GGGTT AGG T TTTTTGCACCTGCTGTT A A TTG AGA C AT AAA A 

AA TG AAAAA A AAATGT C AGCCT A A ATGT C TAATTT 

ATG TTTTAATTAAG 

tttgaatcgaa. t gecgag t gctggc ggaaegg gtcaaa c tegtt. t caaaaac iqaaacagaai*. t tccat! 

gaggaatgcccaggcagtgggaaaaagttgtaccgcgt.g taaatgagai - .-acttgeca tgctgacagacagacagagagacggag 

qaaaa gccccgaaagga t atcagac t ;:ataacca igattggtag* 

icggaagaatcgtgataaacatctaaaccgtggcgcatccatatatcctca Liccgtci tttgtcgatLtcaaagcggattacactt.ggaacaaaaag 



AAAAGTGTCC A GG 

AGCAAGTG 

caaaatataaacggaaaaaaaaaacaactaaaaacacaaaatctcccatccgc 

GGGT CAGAC 

AGGATTATCC TCGAT GC ATAGAAGCTG GCCGGAACTAA CG 

CCTTC ATAATTr A AATAATT T CCACTTGC GGCTTCC 
CACGGCATTTGG 

A C AAC A A AGAAAT T TGCAT A ATTTTA C GCTGC 
AGG T 



Wl-43 



Fig. 1. The Drosophila genome can be parsed into clusters of conserved sequence blocks that are flanked by less conserved DNA. A: Shown is 
a UCSC Genome Browser conservation histogram of a D. melanogaster chromosome 3L region that spans 66 kb of the vvl transcribed sequence 
and 3' flanking DNA. Highly conserved DNA sequences that align with the orthologous regions of other Drosophilids are indicated as peaks in the 
histogram. The rectangle identified as "Your Seq" corresponds to the EvoPrinted region shown in B and the 4 vertical red-colored arrows corre- 
spond to the CSC parsing boundaries shown in B. B: A D. melanogaster (ref. sequence) relaxed 12 species EvoPrint of the Your Seq region in A 
(6,355 bp) identifies three conserved sequence clusters designated vvl-41, -42, and -43. Capital letters represent conserved bases in the D. mela- 
nogaster sequence that are present in all, or in all but one, of the orthologous regions within 11 additional species. Intra-cluster c/s-Decoder CSB 
alignments reveal that over 60% of the conserved sequences within each CSC spans > 6-bp repeat sequence elements (yellow highlight) that are 
either separate, adjacent, and/or overlapping each other. High copy number RPS elements within each CSC are noted with different colored high- 
lights. Red-colored arrows indicate parsing boundaries for the c/'s-Decoder CSC database. 
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TABLE 1. cas-6 CSC Database Search Results Showing Tested Clusters" 



Cluster 


Correlation 


Shared 


Total shared 


Percent 


Required 


Longest 


Conserved 


name 


coefficient 


repeats 


elements 


coverage 


elements 


sequence 


bases 


cas-6 


1.00 


53 


97 


100 


5 


36 


554 


cg7229-5 


0.63 


20 


44 


56.88 


4 


11 


320 


vvl-14 


0.56 


26 


65 


58.63 


3 


11 


498 


nab-1 


0.55 


18 


54 


46.21 


4 


11 


415 


cg6559-28 


0.52 


29 


58 


54.38 


3 


11 


521 


cas-8 


0.50 


65 


147 


72.97 


4 


36 


1,021 


tkr-15 


0.46 


21 


42 


57.74 


3 


10 


265 


grh-15 


0.44 


41 


96 


64.98 


5 


9 


554 


wl-43 


0.42 


42 


93 


60.03 


3 


11 


633 



a See Figures 2-7 for in vivo cis-regulatory activity. 



represent the same region can be 
identified by their similar ess-Decoder 
scores (see below) and/or their similar 
identifying names. It should be noted 
that many REDfly entries were made 
from data that often did not delimit 
the exact boundaries of the enhancer. 
In addition many REDfly entries 
included multiple CSCs or truncated 
CSCs whose ends were restriction 
enzyme sites used for cloning pur- 
poses and were not within less-con- 
served ICRs. To reduce the number of 
truncated entries, EvoPrinted 
regions were expanded to include 
flanking ICRs. Also, since many 
REDfly entries are redundant, care 
was taken to eliminate this redun- 
dancy by eliminating repeated and 
overlapping entries. 



Identifying Enhancers With 
Similar Regulatory Behaviors 

In addition to the comparative analy- 
sis of enhancer sub-structure, our 
goal in establishing the CSC database 
and accompanying search algorithms 
was to identify functionally related 
enhancers. The assumption that initi- 
ated this study is that many function- 
ally related enhancers share overlap- 
ping sets of conserved sequence 
elements. To demonstrate the utility of 
cis-Decoder search algorithms in iden- 
tifying related tissue- and/or temporal- 
specific enhancers, we show how a sin- 
gle enhancer can be used to identify 
other functionally related enhancers. A 
detailed step-by-step tutorial describ- 
ing the use of the search protocol is 
given at the cis-Decoder website (http:// 



cisdecoder.ninds.nih.gov/pages/tutorial/ 
index.html). 

CSC Database Search 
Protocol 

The first step in a CSC database 
search is to enter into the ess-Decoder 
input window an EvoPrinted enhancer 
that spans a single CSC. cis-Decoder 
then parses and annotates constituent 
CSBs in forward and reverse/comple- 
ment directions. By alignment of the 
CSBs to one another, the program 
next identifies multi-copy and palin- 
dromic elements that are >6 bp. A 
table is generated that shows the 
copy-number of each repeat, the ele- 
ment frequency in the database, and 
the number of database CSCs that 
contain two or more of each element. 
Based on our earlier analysis of known 
enhancers, matches of less than 6 bp 
in length were not considered, because 
searches with 5 bases or less yielded 
results that were not informative 
(Brody et al., 2007, 2008; and data not 
shown). 

After identifying RPS elements, the 
cis-Decoder algorithm searches the 
CSC database to discover CSCs con- 
taining these repeats. The search 
algorithm also allows for user sup- 
plied mandatory sequences, to iden- 
tify enhancers that are regulated 
by sequence-specific DNA-binding fac- 
tors or families of transcription fac- 
tors. Once database CSCs are identi- 
fied, the program carries out 
individual CSB alignments between 
the input CSC and the database CSCs 
(see below). Another set of algorithms 
then rates the individual database 



CSCs using the following similarity 
indices when compared to the input 
CSC: (1) A repeat balance profile, that 
assesses relative shared repeat copy 
numbers and weighs them according 
to the RPS length (shown as a pie 
chart and as a repeat balance map, 
which are accessible from the one-on- 
one alignment page; for examples see 
Figs. 3C, 4, 6B, 7A); (2) A correlation 
coefficient, which reflects the relative 
frequency of shared sequence ele- 
ments between the input and data- 
base CSCs; (3) The number of shared 
repeats (full-length RPS elements and 
shorter elements contained within 
longer input repeats); (4) Total num- 
ber of shared elements including RPS 
and uniquely shared sequences; (5) 
Percent coverage of aligning input 
sequences, which reflects the number 
of conserved bases in the database 
CSC that align with the input 
enhancer CSBs, normalized to the 
total number of conserved sequences 
in the database cluster; (6) The num- 
ber of user-specified required elements 
present in the database CSC; (7) The 
longest shared sequence between the 
input and database CSCs (viewed at 
the cis-Decoder scorecard by placing 
the cursor on the sequence length 
number); and (8) The total number of 
conserved bases within the database 
CSC (see Table 1). To allow the user to 
focus attention on any one of the rat- 
ing criteria, the CSCs can be sorted by 
any of the similarity indices in addi- 
tion to sorting by CSC file name. Sort- 
ing by file name allows for the rapid 
identification of closely associated, 
neighboring CSCs that are structur- 
ally related to the input enhancer. 
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cis-Decoder Analysis of a 
castor Late NB Enhancer 

To demonstrate the utility of cis- 
Decoder database search algorithms to 
identify tissue- and temporal-specific 
enhancers, we have used one of the 
late-temporal network NB enhancers 
(database CSC cas-6) that controls 
the embryonic expression of the gene 
encoding Cas, a zinc-finger transcrip- 
tion factor expressed during late em- 
bryonic CNS NB lineage development 
(Mellerick et al., 1992; Cui and Doe, 
1992; Kambadur et al., 1998). Like 
endogenous cas mRNA expression, 
the cas-6 enhancer activates reporter 
transgene expression in CNS NBs 
and ventral cord midline cells during 
embryonic stage 10 and in additional 
ventral cord and cephalic lobe NBs 
during stages 11-13 (Fig. 2A and 
Table 2; for cas mRNA and protein 
expression details see Mellerick et al., 
1992; Kambadur et al., 1998). Evo- 
Print analysis reveals that the cas-6 
CSC is made up of 46 CSBs of 6 bp or 
more and contains 720 conserved base 
pairs in 1,613 bp of genomic sequence 
(Fig. 2B). Mutational analysis of the 
cas-6 CSC via 5' and 3' deletions 
revealed that the entire cluster was 
required for full reporter activity (A. 
Kuzin, unpublished results). The cas- 

6 CSC is located 392 bp 5&prime to 
the cas gene predicted transcriptional 
start site. As described above, one of 
the first steps in the cis-Decoder anal- 
ysis is parsing CSBs from the input 
EvoPrinted enhancer in both forward 
and reverse directions, and then 
aligning the CSBs with one another 
(self-alignment) to discover RPS ele- 
ments (Fig. 2C). More than 65% of the 
conserved bases in the cas-6 CSBs 
were represented in RPS elements; 
an alignment revealed that these are 
either separate, adjacent, or overlap- 
ping each other (yellow-colored high- 
lights in Fig. 2B). Core DNA-binding 
motifs for known transcription factors 
within CSBs are indicated in Figure 
2B and C. 

Prominent among the cas-6 RPS 
elements are three lOmer repeat 
motifs [TTATGC AAAT] , which con- 
tain a POU-homeodomain-octamer- 
binding site [ATGCAAAT] (Herr and 
Cleary, 1995). The highest copy num- 
ber element [ATGCAAA], containing 

7 of the 8 octamer motif sequences, 



was found 5 times (green underlined 
in Fig. 2C). It is considered a sub- 
repeat element, since there is only 
one instance of the heptamer in the 
CSBs that is independent of longer 
elements. Also present are multiple 
elements containing the core ATTA 
sequence for Antennapedia class 
homeodomain containing transcrip- 
tion factors (reviewed by Gehring 
et al., 1994). Also present in the RPS 
elements are two palindromic E-box 
sequences, CAATTG and CAGCTG 
(Murre et al., 1989), while three addi- 
tional E-boxes are present in con- 
served non-repeated sequences. The 
cas-6 enhancer CSBs also contains 
Hunchback and Cas core DNA-bind- 
ing sequences (Fig. 2; Kambadur 
et al., 1998). Given that many of the 
cas-6 RPS elements are novel 
sequences, they most likely contain 
additional binding sites for as yet 
uncharacterized transcription factors 
that modulate enhancer regulatory 
behavior. 

Searching for cas-6 Related 
NB Enhancers 

To identify database CSCs that share 
repeat and unique elements with the 
cas-6 CSC, we initiated a search by 
first identifying CSCs that contained 
at least three copies of the ATGCAAA 
element. Although asking for a man- 
datory sequence is not required, the 
cas-6 RPS table revealed that the 
highest copy number element, ATG 
CAAA, was present 7,208 times in the 
CSC database and 371 CSCs con- 
tained two or more of these elements. 
The cis-Decoder scorecard for this 
search revealed that the database 
contained 104 CSCs with 3 or more of 
this element (data not shown). Thus, 
we focused the search to this limited 
set of CSCs. Once these CSCs were 
identified, one-on-one alignments 
between the input and database CSBs 
were automatically performed to dis- 
cover additional shared sequence ele- 
ments. As expected, the highest scor- 
ing database CSC for most of the 
indices was cas-6 itself (Table 1). 
Other high-scoring enhancers were 
considered as candidate late temporal 
network NB enhancers and were 
tested in enhancer-reporter trans- 
genes (see below). For example, while 
cg7229-5 scored highest for the corre- 



lation coefficient, other CSCs scored 
higher for each of the other metrics. 
Table 1 contains only a fraction of the 
database clusters in the actual read- 
out (currently more than 100), since 
the database has been updated with 
additional CSCs after the initiation of 
the functional analysis of CSCs 
related to the cas-6 enhancer. 

Although the search required the 
hepamer sequence ATGCAAA to be 
present at least three times in the 
database CSC, most of the highest- 
scoring CSCs (both for correlation 
coefficients and shared RPS elements) 
contained at least three RPS elements 
with the full octamer motif [ATG 
CAAAT], including cg7229-5, grh-15, 
vvl-41, and tkr-15 (Figs. 3B, 4B; data 
not shown). In addition, many of the 
CSCs that contained octamer motifs 
also shared, with cas-6, single or dif- 
ferent combinations of bHLH E-box 
DNA-binding sites and repeated 
HOX-binding sites, including shared 
sequences flanking the core ATTA 
motif. An example of the one-on-one 
CSB alignment between cas-6 and 
cg7229-5 CSBs, discovered in this 
search, is shown in Figure 3A. Align- 
ing cas-6 CSB sequences are color- 
coded to represent cas-6 RPS ele- 
ments (red), truncated portions of the 
cas-6 repeat sequences that we term 
sub-repeats (orange), and >6-bp 
sequences that are unique matches 
between cas-6 and cg7229-5 (blue). In 
many cases, different multi-copy 
repeats are nested within larger 
unique matches. For example, within 
the largest unique aligning sequence 
shown in Figure 3A, RPS elements 
corresponding to a HOX site overlap a 
POU-octamer site. We believe that 
this view of overlapping shared motifs 
represents a map of the substructure 
of an enhancer in terms of the tran- 
scription factor-binding sites that 
integrate multiple regulatory inputs. 

cis-Decoder also generates lists 
sequence elements that are shared 
between the input and database CSC. 
For example, Figure 3B shows the 
complete output of repeat, sub-repeat, 
and unique matches between the cas-6 
and cg7229-5 CSCs. Fifty-seven per- 
cent of the cg7229-5 conserved sequen- 
ces aligned with cas-6 conserved 
sequences (Table 1 and Fig. 3C). In 
addition, cis-Decoder also identifies 
RPS elements within the input and 
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Fig. 2. The cas-6 CSC functions as an NB enhancer that regulates gene expression during late embryonic CNS sub-lineage development. 
A: cas-6 CSC enhancer-reporter transgene activates expression in a subset of NBs during late sub-lineage development. Shown are dissected 
fillets of whole-mount stained embryos, stages 10 through 13 (s10-s13; anterior up). B: An EvoPrint of the cas-6 enhancer (same EvoPrint condi- 
tions as in Fig. 1B). CSB sequences that span repeat elements are highlighted in yellow (identified from c/s-Decoder CSB alignments, see C). Col- 
ored underlined bases correspond to the core transcription factor DNA-binding sites (homeodomain, ATTA-red; POU domain, ATGCAAAT-green; 
bHLH, CANNTG-brown: Hunchback/Castor, I I I I l/AT-blue; Tramtrack, TCCT-gold; and PBX sites, TGAT-teal). C: c/s-Decoder self-alignment of the 
cas-6 enhancer CSC identified 50 distinct repeat or palindromic elements. The total element count in the table refers to the number of times a 
repeat appears in the CSC database. Colored asterisks indicate repeats that contain core known transcription factor DNA-binding motifs high- 
lighted in B. The green-colored underlined repeat indicates the sequence (ATGCAAA) that was used to identify other late sub-lineage NB 
enhancers that share sequence elements with cas-6 (see Figs. 3-5). 
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TABLE 2. Location, Structure, and Expression Dynamics of CSC Transgenes" 



CSC 




CSC 


Conserved 


Transgene 


Expression 






name 


Chromosome 


length (bp) 


bases (bp) 


Embryo 


Larva 


Adult 


Figures 


cas-6 


3R 


2,242 


651 


NBs 2 


None detected 


None detected 


2 


cg7229-5 


2R 


849 


367 


NBs 


None detected 


None detected 


3 


vvl-14 


3L 


1,128 


544 


NBs 


Not done 


MB 


5 and 9 


nab-1 


3L 


1,012 


369 


NBs 


Subset CL and VC NBs 


MB 


5 and 9 


cg6559-28 


3L 


1,484 


452 


NBs 


Many CL & VC NBs 


MB, TmY 


5 and 9 


cas-8 


3R 


2,664 


1331 


NBs 


subset bUG neurons 


None detected 


5 


tkr-15 


2R 


1,091 


337 


GMCs 


Subset CL and VC NBs 


None detected 


5 and 9 


grh-15 


2R 


1,376 


621 


NBs 


Subset CL and VC NBs 


None detected 


5 and 9 


vvl-43 


3L 


1,934 


738 


Ectoderm 


Subset CL and VC 


SOG and optic 


5 and 6 












neurons 


lobe neurons 




sqz-11 


3R 


1,082 


427 


NBs 


Subset CL and VC NBs 


None detected 


5 


ct-14 


X 


1,146 


321 


NBs 


CL neurons and 


None detected 


5 












subset VC glia 






ct-3 


X 


689 


284 


NBs 


CL neurons and 


None detected 


5 












subset VC glia 






vvl-41 


3L 


1,590 


725 


NBs 


Subset CL and VC 


Subset of SOG 


5 and 7 












neurons 


neurons 




cg32264-76 


3L 


833 


263 


None 


Not done 


Subset of CL 


S3 










detected 




neurons 





a Multiple, independent enhancer-reporter transgenes were tested for each CSC and all were integrated into the attP2 site on 
chromosome 3L at 68A4 via the Phi31 transgene integration method (Groth et al., 2004). NBs, neuroblasts; GMCs, ganglion 
mother cells; CL, cephalic lobes, VC, ventral cord; SOG, sub-esophageal ganglion; MB, mushroom body neurons; TmY, Trans- 
medullary Y neurons in optic lobe. 



database CSC that are not shared 
between the two CSCs, and these ele- 
ments are also listed on the one-on- 
one alignment page (data not shown). 



Functionally Related NB 
Enhancers Share Balanced 
RPS Element Copy Numbers 

The relative frequency of appearance 
of sequences in cg7229-5 that corre- 
spond to cas-6 RPS elements is shown 
by color-coded highlights (Fig. 3C). 
We term this comparison a "repeat 
balance map," a visual representation 
that illustrates the relative frequency 
of appearance of each of the shared 
motifs in the comparison between the 
input and database enhancers. Forty- 
six percent of the aligning bases 
within the cg7229-5 CSC are present 
in the same ratio in the cas-6 CSC. 
The predominance of green and grey 
highlights indicates that many of the 
shared elements in the two enhancers 
are present at equal frequency. Another 
example of a CSC identified in this 
search that shares balanced RPS ele- 
ments with the input cas-6 is the grh- 
15 CSC (Fig. 4; Table 1), also a tempo- 
ral network NB enhancer (see below). 



To test the in vivo cis-regulatory 
activity of CSCs, we selected CSCs 
that contained both repeat and unique 
sequence elements found in the cas-6 
enhancer. The CSCs were selected 
based on rating criteria described 
above, as shown in Table 1. Enhancer- 
reporter transgene transformants for 
the individual CSCs were generated 
using the targeted <pC31 integration 
system to ensure that the regulatory 
behavior for each was assessed in 
the same genomic environment (see 
Experimental Procedures section and 
Supp. Fig. S4, which is available 
online). Although not an exact match, 
the expression pattern of the cg7229-5 
enhancer transgene shares many of 
the expression dynamics of the cas-6 
enhancer-transgene (Fig. 3D; Table 2). 
As with cas-6, onset of cg7229-5 
expression is in a subset of midline 
cells and a single lateral NB at stage 
10, and expression in subsequent 
stages closely matches, but is not iden- 
tical to, expression of the cas-6 re- 
porter. The insert shows that cg7229-5 
reporter GFP expression overlaps but 
is not identical to that of cas-6 red fluo- 
rescent protein reporter. 

Many of the tested CSCs (Table 2 
and discussed below) yielded detecta- 



ble CNS expression and function as 
late temporal network CNS neuro- 
blast enhancers (Figs. 3, 5; Table 2; 
data not shown). Eleven were 
expressed in late temporal network 
ventral cord NBs and three were 
expressed in other CNS precursors or 
neurons (Figs. 3, 5). Comparing these 
expression patterns to the cas-6 re- 
porter expression (Fig. 2), it is appa- 
rent that each functions as a late tem- 
poral network enhancer. An indication 
of the specificity of the search for cas- 
6-like enhancers is that the search did 
not identify early temporal NB 
enhancers (Brody et al., 2008; Kuzin 
et al., 2009), nor did it identify broadly 
expressed NB enhancers such as that 
of deadpan (Emery and Bier, 1995). 

Although the cas-6-related enhancers 
are active in overlapping neural pre- 
cursor cells, each has its own unique 
cts-regulatory identity. Each has a dif- 
ferent pattern of expression in subsets 
of NBs, GMCs, and/or nascent neu- 
rons. For example, three identified 
enhancers {nab-1, CG6559-28, and tkr- 
15) exhibit early expression in a subset 
of ventral cord midline cells, while sqz- 
11 and vvl-41 (identified using cas-8 as 
the input CSC) exhibit onset in a 
larger number of midline cells while 
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cg7229-5 enhancer showing a repeat balance map with cas-6 enhancer RPS elements 

ATCT CTGTCT T CA CGATCACCTAG GC 
IATTTTCACG AAAATT ATGC A A AT C TTTCTCACG ATCGCA TACCTGACACC 

A AATTTGCATAAG TTTTC AC TTGGCGAA 
ATTT G CATAA G G A TGC AAA G A AACGAA AGTGAAA ATG AAAA 

AGAC AGG G GTTAATT A GC ACAGTGG ACGGCGC C CAGC ATGC G C ACGCTCGCACACC 

GC AGGGGGTTGAA AGGGAC C GCTCATTTGCATAAG ATGTG 

TTT TTG AAAA ATT ATG T AAAT A CGTGCAAAB G AGACTTTCTCACGCTC AACTCCTT GGC 

TGTTTGGGTTTC GC CA CGGGC AACAAATCAATTTTGATCG ATTT TCGATTGAT 

H ATTT CAT ATT 




Fig. 3. Late sub-lineage NB enhancers share conserved repeat elements that are balanced in their frequency of occurrence. A: A one-on-one 
CSB c/s-Decoder alignment of three consecutive cg7229-5 CSBs (nos. 2-4) with cas-6 CSB sequences. Color-coded bases: Green, the required 
cas-6 repeat element used to identify other CSCs in the database search; Blue, sequences are present just once in the cas-6 enhancer; Red, cas- 
6 repeats; Orange, shorter (> 6 bp) repeat sequences that are part of larger cas-6 repeats. The cas-6 CSB number and alignment orientation (for- 
ward or reverse) is indicated following each aligning sequence. B; cas-6 and cg7229-5 share conserved elements that are unique (blue), repeat 
(red), or sub-repeat (gold) elements within the cas-6 CSC (green underlined sequence indicates the mandatory element used to initiate the CSC 
database search). C: A cg7229-5 CSC 12 species relaxed EvoPrint. Sequences that are present within cas-6 CSBs are highlighted in the cg7229- 
5 CSBs and color-coded to indicate their relative frequencies (see Fig. 4 for color code). D: cg7229-5 CSC enhancer-reporter transgene expres- 
sion analysis (GaM-reporter mRNA in situ hybridization) reveals that, similar to the cas-6 enhancer, the cg7229-5 CSC functions as a late temporal 
window NB enhancer (embryo preparations as in Fig. 2A). Inset: Co-expression analysis reveals partial overlap between cells expressing cas 
mRNA (green) and those expressing the cg7229-5 enhancer-reporter transgene mRNA (red; stage 1 1 , dorsal whole-mount view). 
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Repeat Balance Map for Database CSC grh-15 when aligned with input cluster cas-6 
Repeat Ratio Distribution 
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Repeat Balance Map for grh-15 



TCGCCACCCCTTTGCTCAATTACAC CC ATA TCC 

ATGAGAATT^^^H CG| 

T6TTTGTTCA6GACGA AAGGAGTAG G 

C TCAACATCATCAATCAGATAAGCTTAAATGACGTCTG 
AAGAATGACATCAAAGTTACAGTTGAAA TG AA G 



G TAAGTAAACAACAATTGAA T A 

G 

AA TGCCATATTTTGCAAGGC C A 
CTTT 

^■a a aaaaa 
gaagaaaacacac 

GGGGT 

GAA ACAA 

CTAATCCTTTTGACTTT 

TATATGTCAG 




ATGATAAATGATA 

TTG 

ACAATAC 
_|TTG AAACTGAACCGTTGAGAG 

AATG 

AT CGAA ATTTGCATAC 

AAAAACTT A 

TCTC ATTACCATAA 
TGTCAAAATTAATTTACGAGTTTATTTGTTTAG CGCGCC 
CATTTGCAT TCT AACGGGG AAAA ■ 

GAGTATAATTTACGAGCTGATTAGCCCATGAATATGTAAAAA 
C GTAATCAG 

| TAAAAAA TA ACACAAAAT ATA A A A |TGCAAATGCTGG 
AAATG TTTTTCTGCTAAATCGA TTCGTCCTTTTGT 

GTAGGGTGC GA AGTT TGAGT G GTGTCACAACG A 



Fig. 4. c/s-Decoder analysis reveals that cas-6 and grh-15 CSCs share many sequence elements that are balanced in their copy number. Shown 
are a pie chart and a repeat balance map, both of which illustrate the relative copy number balance of shared elements between cas-6 and grh- 
15. The repeat balance map of a relaxed grh-15 CSC EvoPrint was highlighted to show comparative frequency of elements that are shared with 
the cas-6 CSC. Green indicates balanced repeat element numbers between the two CSCs; yellow highlights repeats that are unbalanced by just 
one copy; purple, two copies; and red, three or more copies. Gray highlighted sequences are present just once in the cas-6 CSBs. When uniquely 
shared sequences overlap repeat sequences, the repeat-ratio highlight color indicator is shown. When repeat elements overlap one another, the 
balance-ratio highlight of the longer repeat is shown, and when two repeats of equal size overlap, the more balanced repeat is highlighted. 



other enhancers do not activate re- 
porter expression in the midline pre- 
cursor cells (Fig. 5). The cas-8 CSC 
activated reporter expression in many 
more precursors at stage 11 than any 
of the other reporter constructs, tkr-15 
is expressed in many cells at stage 11. 
Since these cells are too small to be 



considered NBs, they are most likely 
GMCs or nascent neurons. Comparing 
different transgene reporter expres- 
sion patterns in lateral ventral cord 
cells at stage 11 reveals that for cer- 
tain CSCs, in particular sqz-11, ct-3, 
[identified using the pdm-2 NB 
enhancer as input (see Fig. 5B)], fewer 



lateral cells express, or they exhibit 
uniquely different spatial expression 
patterns. This is also true for ct-14 
(identified using combined cas-6 and 
CG6559-28 as input) and vul-41 (iden- 
tified cas-8 as input), cas-6 and cas-8 
enhancers both drive reporter expres- 
sion in overlapping subsets of cells 
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Fig. 5. Identification of novel embryonic neural precursor cell enhancers based on their shared repeat sequences with other known neural 
enhancers. A: Like the cg7229-5 enhancer (Fig. 3), additional database CSCs (Table 1) were identified that share balanced repeat sequences with 
the cas-6 enhancer, and they also function as late NB sub-lineage enhancers. Many identified CSCs are adjacent to known NB expressed genes 
(yvl, nab, cas, tkr, and grh). B: Additional late sub-lineage neural precursor cell enhancers were also identified in c/s-Decoder CSC database 
searches using CSBs from different NB enhancer CSCs as input {vvl-41 and sqz-11, identified via the cas-S CSBs; ct-3, using the pdm-2 gene NB 
enhancer CSBs; Berman et al., 2004); and ct-14, using the cg6559-28 CSBs (Fig. 5A). Shown are dissected fillets of whole-mount-stained 
embryos, stages 10-12 (left to right, respectively, anterior up). 
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that represent sub-patterns of endoge- 
nous cas expression (Figs. 2, 5 and 
data not shown). 

Our studies also revealed that there 
is no apparent consistency in the 
ordering, overlap, or orientation of 
shared elements between functionally 
related enhancers. For example, RPS 
elements shared between cas-6, 
cg7229-5, and grh-15 appear in 
unique contexts within each enhancer 
(Supp. Fig. SI). This lack of consis- 
tency in positioning of shared ele- 
ments has also been noted in early 
sub-lineage NB enhancers (Brody 
et al., 2008). 



Unbalanced RPS Elements 
Indicate Different Enhancer 
Regulatory Behaviors 

During the functional analysis of data- 
base CSCs that share RPS elements 
with cas-6, one of the CSCs, vvl-43 
(see Fig. IB for EvoPrint profile), was 
found to share 92 RPS and unique 
sequence elements with cas-6 (Fig. 
6A). It did not, however, drive trans- 
gene reporter expression in NBs but 
activated expression instead in the 
embryonic ectoderm (Fig. 6C and Ta- 
ble 1). cis-Decoder analysis of the 
shared RPS elements revealed that 
the balance of PRS elements was 
markedly different between cas-6 and 
vvl-43 (Fig. 6B). Notable is the large 
number of conserved HOX motifs 
within vvl-43 in comparison to cas-6. 
Expression of vvl-43 in the embryonic 
ectoderm is segmental, and although 
temporally late, there is no embryonic 
CNS expression (Fig. 6C). Previous 
studies demonstrate that the vvl- 
encoded protein, a POU homeodomain 
factor, is expressed in the CNS and in 
the ectoderm of embryos, suggesting 
that vvl-43 functions as an ectodermal 
enhancer for vvl expression (Anderson 
et al., 1995; also see figure 9A of Kam- 
badur et al., 1998). The disparity of 
shared element frequencies between 
cas-6 and vvl-43 (Fig. 6B) is in 
marked contrast to the similarity of 
frequencies when comparing cas-6 
and cg7229-5 (Fig. 3C). That lack of 
balance in shared element copy num- 
bers between enhancers suggests that 
they may have different regulatory 
behaviors. 



Another example of how unbal- 
anced RPS elements indicate func- 
tionally different enhancers can be 
seen in the comparative analysis of 
vvl-41 with vvl-43 CSCs (EvoPrints 
are shown in Fig. IB). Like the previ- 
ous comparisons to cas-6, the vvl-41 
and vvl-43 CSCs share similar ele- 
ments (Fig. 7); vvl-41 shares 96 RPS 
and unique elements with vvl-43 
CSCs, and 68% of the vvl-43 con- 
served sequences are covered by these 
shared elements (data not shown). 
Although these two CSCs have exten- 
sive overlap of shared elements, the 
repeat balance index and correlation 
coefficient reveal that their shared ele- 
ments are not balanced in copy num- 
ber (Fig. 7A and data not shown). Con- 
sistent with the imbalance in their 
shared elements, these enhancers 
displayed markedly different regula- 
tory behaviors in the embryo (Figs. 
5B, 6C). Nevertheless, these two 
enhancers drive reporter expression 
in different sets of larval neurons. 
Whereas most of the cells expressing 
the vvl-41 reporter transgene are sub- 
esophageal ganglion interneurons, 
vvl-43 enhancer drives reporter 
expression in a subset of ventral cord 
motor neurons (Fig. 7B). Thus the 
presence of identical elements in dif- 
ferent clusters does not necessarily 
lead to similar regulatory behaviors, 
and comparing shared element copy- 
numbers has a better predictive value 
for determining enhancer behavior. 



cis-Decoder Searches Identify 
Novel Sequence Elements 
Present in Other Families of 
Functionally Related 
Enhancers 

To further test the ability of cis-De- 
coder database searches to identify 
different families of functionally 
related enhancers and to compare our 
search protocols to other enhancer 
search algorithms, we initiated data- 
base searches with different well- 
characterized enhancer types. Using 
the Kriippel gap enhancer KrjCDl 
(Hoch et al., 1990), we identified the 
giant gt_(—10) enhancer (Schroeder 
et al., 2004) (Fig. 8A). Besides sharing 
HOX sites with different flanking 
bases (Fig. 8A), the two enhancer 
CSCs also share a 14-bp sequence, 



TGAACTAAATCCGG (see boxed 
sequence in Fig. 8A). Remarkably, 
this 14-bp element within the Kriippel 
enhancer was identified as a site of 
competitive binding by the activator 
Bicoid and the repressor Knirps tran- 
scription factors (Hoch et al., 1992). 
The conservation of interlocking or 
overlapping docking sites for Bicoid 
and Knirps within both of these gap 
enhancers supports the contention 
that large CSBs (containing 7 to 10 bp 
or more) most likely function as the 
point of integration of multiple tran- 
scription factors in the regulation of 
enhancer behavior. 

Our search using the KrJODl also 
identified the kni_(+l) intronic gap 
enhancer (Schroeder et al., 2004). 
Shared sequence motifs between 
Kr_CDl and kni_(+l) include multi- 
ple polyA/polyT motifs, presumably 
targets of Hunchback, that are found 
in even balance (five copies) between 
the two enhancers (Fig. 8B). Other 
shared sequences include several 
HOX-binding sequence elements. 

Previous work has shown that 
many segmentation genes utilize mul- 
tiple enhancers that regulate gene 
expression in nearly identical pat- 
terns (reviewed by Hobert, 2010). 
These enhancer pairs have been 
termed (1) primary enhancers, found 
closely associated with the transcrip- 
tional start site, and (2) "shadow" 
enhancers, found at a distance from 
the structural gene. Starting with the 
primary vnd ventral neuroectoderm 
enhancer CSC (Hong et al., 2008), a 
cis-Decoder search identified its 
shadow enhancer based on the bal- 
anced copy number appearance of its 
RPS elements and uniquely shared 
sequences (Supp. Fig. S2; and data 
not shown). In addition to other 
shared elements, both of these 
enhancers contain 2 copies of the 
CACATGA bHLH motif, which 
matches the optimal DNA-binding 
site for the transcriptional regulator 
Twist (Ozdemir et al., 2011). 

We next tested the cis-Decoder 
search algorithms to see if it would be 
possible to detect enhancers regulated 
by Notch signaling (Nellesen et al., 
1999). Previously identified Notch- 
targeted enhancers include those 
associated with the E(spl) complex 
genes. Multiple alternative binding 
sites within these enhancers have 
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wl-43 enhancer showing a repeat balance map with cas-6 enhancer RPS elements 

TGT ATTT CTTGGCCCATAGCTG 

^^^^ GGA A tClgUtSttt' AGCTC A TTTAT C AAAA 

GTC ATTGAT A GC TTTTGCA 

.lacaaaa t.g', e g a - a ra ce tgtccc t. gtccccflaatgaaaccaa tatatgs aattgaa a 

CCAAATA^^^H G C GG GCATCCCTTAGA GTT A A AAGGTC A G TATC T 

7 A AAC GCTATT ATT A GGTC A CAAATC AATGAAG ACTTTATCGA GCG A CTCATT ACAA 

CAAATCTGGG CT GATTACAAT^B A GCCGCG CGA ACT AAAA 

TTTATTGACAATT AAAAGTGTCCA GGGAGAAAGGT 

AGAAA AGCAAGTG C C AT AAA T T r 

CGATCAAACGGATC TTAGAC CCGCC 

GT G GGGT CAGAC A 

AG G ATTATC C TCGAT GCATA GAAGCTG GCCGGAACTAA COATTTGCAT T 

TA G AG ATTT T 

AGGTCAAA A ATTGiyrATT QItC ACGCA A G A CAACAA 

ccaacagctgMHH^^Bag 

gc ggc gggtt agg t 



catatd' 



CCT T C AT AA T 1 
TACGCTGC 



T CCACTTGC GGCTTCCTTT 
CACGGCATTTGl 



TTTTTGCAC C TGCTGT T A ATTGAGA C ATAAAA 

AAAA TGTCA GCCTAA ATGT C ■ATTTATG TT 



AA TG 



AAAA A 



c 




Fig. 6. Enhancers that share unbalanced repeat elements between their CSCs carry out distinct regulatory functions. A: c/s-Decoder alignments 
between the cas-6 enhancer and vvl-43 CSBs identified 93 different unique (blue), repeat (red), and shorter truncated-repeat (orange) sequence 
elements that were common to each CSC (green underline indicates the cas-6 repeat that was used to initiate the c/s-Decoder CSC database 
search). B: The vvl-43 CSC relaxed EvoPrint was highlighted to show repeat element frequencies relative to the cas-6 enhancer (see color coding 
in Fig. 4). C: vvl-43 CSC enhancer-reporter transgene expression analysis (Ga/4-reporter mRNA in situ hybridization) reveals that, unlike the cas-6 
enhancer (Fig. 2A), vvl-43 activates reporter expression in a subset of ectodermal cells during stage 1 1 and no reporter expression was detected 
in CNS NBs. Shown are filleted-flattened preparations of whole-mount-stained embryos, embryonic stages 11-14 (anterior up). 
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Repeat Balance Map for Database CSC wl-43 when aligned with input cluster wl-41 
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Repeat Balance Map for wl-43 
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Fig. 7. vvl-41 and wl-43 enhancers exhibit an imbalance in copy number of their shared elements as evidenced by the low level of perfectly 
matched sequences. A: Shown are a pie chart and a vvl-43 CSC repeat balance map that illustrate the relative copy number balance of shared 
elements between vvl-41 and vvl-43 (see Fig. 4 for ratio map color code). B: vvl-41 and vvl-43 CSCs function as larval neural enhancers that drive 
the expression of a membrane-bound GFP-CD8 reporter in different sets of CNS neurons. Shown are dissected cephalic lobes and ventral cords 
from wandering third-instar larva (dorsal views, anterior up). 
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been identified for Suppressor of 
Hairless [Su(H)], the transcription 
factor utilized by the Notch pathway 
(Bailey and Posakony, 1995; Castro 
et al., 2005). We initiated a cis-De- 
coder search with one of the CSCs 
(Espl-1) to discover other similarly 
structured CSCs, using as required 
sequences a single Su(H)-binding site 
(TGGGAA) and a single bHLH-bind- 
ing site (CAGCTG). This search 
resulted in 101 database hits, includ- 
ing CSCs from known Su(H) targets 
m2, m6, and my (Castro et al., 2005) 
as well as putative enhancers for the 
neural determinants Dichaete, dead- 
pan, nervy, tailless, castor, Fps85D, 
Notum, and extra macrochaetae (data 
not shown). In addition, searching 
with the Notch-targeted deadpan NB 
enhancer (San-Juan and Baonza, 2011; 
cis-Decoder CSC dpn-3), that contains 
two alternative Su(H)-binding sites 
(GTGAGAA; Bailey and Posakony, 
1995; Lecourtois and Schweisguth, 
1995; Nellesen et al., 1999), we identi- 
fied other putative Notch pathway 
targeted enhancers: CG7229-5, cas-8, 
a HLHmfi-associate CSC (HLHmbeta- 
2), and the m4 PNS enhancer (Nelle- 
sen et al., 1999). Thus, ess-Decoder 
searches can identify functionally 
related enhancers that regulate gene 
expression during different phases of 
development and in different tissues. 



Many Enhancers Regulate 
Gene Expression During 
Multiple Phases of 
Development 

Each of the embryonic NB enhancers 
identified above were also tested 
for regulatory activity during later 
stages of development, and many 
were observed to activate transgene 
reporter expression in the third instar 
larva and/or adult CNS. Three of the 
tested enhancer transgene reporters, 
cg6559-28, grh-15, and tkr-15 exhib- 
ited expression in a similar pattern 
within brain neural precursor cells, 
thoracic neuromeres and posterior 
neural precursors of the thirdinstar 
larva CNS, while the cas-6 and cas-8 
enhancers were not active in larvae 
(Fig. 9A; Table 2; and data not 
shown). The ct-3 and ct-14 CSCs 
drove expression in small subsets of 
neurons in the sub-esophageal gan- 



glion and in the ventral cord abdomi- 
nal neuromeres (data not shown). 
Additionally, nab-1 expression was 
similar to that of the dnab e310 
enhancer-trap expression in third- 
instar larvae CNS (Clements et al., 
2003; data not shown). In the adult, 
many of the enhancers were 
expressed in a subset of central brain 
neurons, and in the optic lobe. Specifi- 
cally, cg6559-28, vvl-14, and nab-1 
reporters were expressed in the 
mushroom body (Fig. 9B). While cas-6 
was not expressed in the adult 
brain, cas-8 reporter expression was 
detected in the ellipsoid body in a pat- 
tern similar to cas adult expression 
(Hitier et al., 2001; data not shown). 
In addition to analyzing the 14 CSCs 
listed in Table 2, we also examined 
the embryonic and adult reporter 
expression of another 60 CSCs, cho- 
sen by a variety of criteria. Many of 
these activate transgene reporter 
expression in both the embryonic and 
adult CNS (data not shown). Given 
the fact that CSC sub-regions of these 
multiuse enhancers have not been 
tested for reporter activity, we cannot 
rule out the possibility that different 
regions within the cluster have auton- 
omous functions and represent discrete 
enhancers. However, our functional 
analysis of the nerfin-1 NB enhancer 
and the cas-6 enhancer CSCs has 
revealed that full enhancer function 
requires the complete cluster (Kuzin 
et al., 2009 and unpublished experi- 
ments). The EvoPrinter algorithm pro- 
vides a methodology for testing for 
the close apposition of independent 
enhancers (Kuzin et al., 2009). 



Dissecting P-Element 
Enhancer-Trap Line 
Expression Patterns 

Previous cis-regulatory analysis of 
genomic regions flanking enhancer- 
trap insertion sites has revealed 
the basis of enhancer-trap expression 
in terms of flanking endogenous 
enhancers regulating P-element 
reporter transgenes (O'Kane and Gehr- 
ing, 1987). P-element Gal-4 enhancer- 
trap lines have been used extensively 
to drive transgene expression during 
development (reviewed by Hummel 
and Klambt, 2008). Although this 
approach has been of great utility, 



many of the Gal4 driver lines are of 
limited use due to their broad expres- 
sion patterns, which is most likely 
due to multiple tissue/temporal spe- 
cific enhancers regulating Gal4 trans- 
gene expression. In addition to the 
discovery of new enhancers, cis- 
Decoder tools can be used to identify 
and differentiate between enhancers 
that flank the insertion site of P-ele- 
ment enhancer- trap constructs. For 
example, the enhancer-trap Gal4 line 
c492a activates UAS-transgene expres- 
sion in a subset of neurons within the 
adult mushroom body, in the antenna 
lobe, and in a subset of neurons in the 
central brain (Armstrong and Kaiser, 
1997; Flytrap web site: http://www. 
fly-trap.org/). The c492a P-element 
insertion site was localized to the 4th 
intron of an uncharacterized gene, 
cg32264 (see Supp. Fig. S3 A). cis-De- 
coder analysis of CSCs in the vicinity 
of the insertion site revealed a candi- 
date CSC, cg32264-76, with sequence 
properties suggestive of a neural 
enhancer. RPS analysis revealed that 
cg32264-76 contains two extended 
octamer motifs [TTATGCAAAT] (Supp. 
Fig. S3B). The cg32264-76 reporter 
expression pattern corresponds to a 
subset of neurons marked by the 
c492a enhancer-trap GAL4 reporter. 
The adult expression pattern included 
a large neuron that extends dendrites 
into the optic lobe (Supp. Fig. S3C 
and Table 2). This suggests that the 
combined EvoPrinter and cis-Decoder 
analysis will help in the identification 
of specific enhancers to further refine 
transgene expression. 

Criteria for Identifying and 
Evaluating Related CSCs 

Although each of the cis-Decoder 
scorecard indices provides useful in- 
formation in judging the relationship 
of the input enhancer to database 
CSCs, we have found the repeat bal- 
ance index and the correlation coeffi- 
cient (see criteria 1 and 2 above) are 
more accurate indices when searching 
for functionally related enhancers, 
since they take into account not only 
the number of shared elements but 
also the RPS copy number balance 
between the input enhancer and data- 
base CSC. The percent alignment cov- 
erage is likewise an important indica- 
tor of the relationship between the 
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TAGCTTAAACGAGTTAAAGGAACTTGAACGGAACTTAATCCCATAAATTTGGGAAAAATACAATTITAATCAGGTTAATTCGTGTGTAGCGCG 



AGCTTAA(Xr_Cdi-07R) 
GCTTAA(XT_Cdi-06R) 

CTTAAA(Xr_cdI-09F> 

AGTTAA ( kr cdl- 0 1 F ) 
AGTTAA ( XT_ CdJ-OIR) 



AAGGAA(XT edl-09R) 



CTTAATC ( AT CdJ-07R) 
CTTAAT ( kr CdJ-17R) 
TTAAICC(AT_CdJ-04R) 
TAATCC ( *r_cdl-0 1 F ) 

CCATAA(*T cdJ-13R) 



AATTTT ( kr CdJ-04F) 

TTTAATC(*r_Cdl-04R) 
TTAATC ( kr_Cdl -07R ) 

TTAATT ( kr_ cd J- 1 7 R ) 



TAAATTT { kr cdl-Ot?) 

A(AT_Cdi-09R) 

k(kr cdl-1 1R) 
I(*r_cdl-04R) 

AAAAAT( kr_Cdl-\ 9F) 



JtlO-OSF-CTGnOCATOO gtl C - 09F-GCCAGGTAG gtlO- 1 0 F-GCGAAAGGATTAGGCC gt I 0- HF-CCIGCGACATTTIAATITAATCTCAGATTACG 
TTOCAT(Jrr Cdl-19R) CAGGTAG(Jcr _cdJ-14R) GGATTA ( kr Cdl-OIH) lTTAAT(*r_cdJ-04R) 
CAGOIAO(*r Cdl-15F) GOATTA(Jcr_Cdi-04F) TTAATT ( Jtr_CdJ-17R) 

AGGTAG( AT_cdi-03R) AATTTAATC( ^r_cdJ-04R ) 

GCGAAA( kr_cdl-l 1R) ATTTAA(AT cdl-09R) 

TTAATC ( kr cdl-OTR) 
TAATCTC ( kr Cdl- 1 OF ) 

AGATTA()cr CdJ-lOR) 
ATTACG(*r cdJ-08R 

gtiO-12F-GGATCAO gt 1 0- 13F-AOATTAOGCCAC gt 1 0-1 4F-GCTCOTATATTOC gt 1 0- 1 SF-ATTGCAAAAGTTTCACGTAAATCC 
OOATCA(*T_cdJ-07R) ATATTOf kr cdl-OiT ) TOCAAAA( krcdl- 19F ) 

OOATCA(*T Cdi-13F) TAAATCC (AT_cdJ-09F ) 

AGATTA (*rcdJ-10R) TAAATCC ( kr_ cdl- 2 OF ) 

AAATCC ( AT_cdi-09R ) 



B 



kni_(+1) gap enhancer repeat balance map with Kr_CD1 enhancer conserved elements 

■ tttttttttgttatgcaagaaatcccgcgttagtaagggtttaatccactggtcgagaggtatatgtgtaatccacaagtaggcgaacggctct 

ACTAACCAAGTTGAACACCATTTTG 

GCTGCTG GC ACTGATG C TTTTTTA GATCA C 

AGG CAAAAAACTTAAGCTGCCGGATTATGCAACCCTATGCG TTT TI AGGTAG TT C TTTTTA 

AAGTCATTACG CCTAAAAAAAIG AG ICC GACCTTGACGT G CATAAAAA 

CA AAAAAA AAATT 

TTTTCGC GAAA G TAAAAAAT AAGTAGTGCA AA 1AAAAT TGCCG CGG AAA AA AA AA CGTGAAITOACTTTG 

AACTTACCTTT 

tgctgcacgttttctcacttacgtatggcag1:ttgtgtctgcatatcaccttagcccaagtcaaatatta r 



Fig. 8. c;s-Decoder CSC database searches identify shared conserved sequence elements among cellular blastoderm gap enhancers. A: CSB align- 
ments between the Kruppel and giant gap enhancers {Kr_CDl , Hoch et al., 1992; gt_10, Schroeder et al., 2004) identify 42 distinct conserved sequence 
elements of 6 bp or greater that represent 55.62% of the conserved bases within the gt_10 enhancer. The red-colored boxed 14-bp sequence corre- 
sponds to the characterized overlapping Knirps and Bicoid transcription factor bindings sites that are required for the wild-type Kr_CD1 enhancer regu- 
latory behavior (Hoch et al., 1992). B: The knirps gap enhancer CSC (kni_(+1); Schroeder et al., 2004) relaxed EvoPrint was highlighted to show shared 
RPS and unique element frequencies present in the Kr_CDl gap enhancer (see color coding in Fig. 4 for RPS balance index). 



Fig. 9. Many NB enhancers that regulate 
gene expression during embryonic CNS de- 
velopment also activate gene expression dur- 
ing adult development and in the adult 
nervous system. A: During third-instar larval 
development, the cg6559-28, grh-15, and tkr- 
15 CSC enhancer-Ga/4 driver transgenes acti- 
vate membrane-bound GFP-CD8 tagged 
transgene expression in sub-regions of the 
cephalic lobes and in thoracic ventral cord 
neural precursor cells. Shown are dorsal views 
of dissected CNS preparations from wander- 
ing third-instar larva (anterior up). B: In the 
adult brain, the cg6559-28, vvl-14, and nab- 7 
enhancers drive GFP-CD8 reporter expression 
in neurons whose cell bodies reside in the 
mushroom body calyxes and in different 
regions of the sub-esophageal ganglion. 
Shown are confocal, optical sections of GFP 
immunostained adult brains (frontal views) at 
the level of the mushroom bodies and the 
sub-esophageal ganglion. 




Fig. 9. 
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input and database CSCs. Thus, sort- 
ing the scorecard by the repeat bal- 
ance index or by the correlation coeffi- 
cient increases the likelihood that 
functionally related enhancers rank at 
the top of the list. For example, all of 
the late temporal NB enhancers iden- 
tified in this study had repeat balance 
index scores of greater than 1.0, corre- 
lation coefficient rankings of above 
0.4, and percent coverage of >40%. 

To estimate the number of false- 
positive predictions and functionally 
related enhancers that were missed 
in cis-Decoder searches, we used the 
cas-6 as the input enhancer (for 
search conditions see above). The 
search returned 111 database hits, of 
which 27 that shared many repeat 
elements with cas-6 were tested for 
enhancer activity in flies. Of these, 12 
proved to be late temporal network 
enhancers, with each being expressed 
in a different subset of midline, brain, 
and/or ventral cord neuroblasts. 
Eleven were expressed exclusively ei- 
ther in adult brain, larval precursors, 
or in embryonic neurons, and four 
were considered negative, since their 
reporter expression was undetectable 
or found in other tissues other than 
the nervous system. As for enhancers 
that were missed in the search, we 
have identified late temporal network 
enhancers that do not contain three 
or more complete or partial octamer 
sequences, or do not score highly 
using cas-6 as input. The low-scoring 
enhancers included sqz-11 and vvl-41, 
which were discovered using cas-8 as 
the input CSC (mentioned above). 
Likewise, ct-3 and ct-14 did not con- 
tain three octamer sequences, and 
they also proved to be late temporal 
network NB enhancers. Finally, we 
have identified five other late tempo- 
ral network enhancers that do not 
contain octamer motifs but do contain 
other repeated elements found in late 
temporal network enhancers (data 
not shown). It is clear from these 
results that a search for enhancers 
using a mandatory sequence, such as 
the octamer motif, is insufficient to 
detect the full genomic repertoire of 
late temporal network enhancers. To 
identify as many functionally related 
enhancers as possible, multiple data- 
base searches using different search 
criteria, are recommended. Our cur- 
rent understanding of the role of octa- 



mer motifs in conferring temporal 
gene expression is incomplete, in that 
we are unable to fully distinguish 
between embryonic late temporal net- 
work enhancers, and octamer-site 
rich larval or adult brain enhancers. 
Nevertheless, the fact that only four 
of the 27 clusters tested were not 
expressed in the CNS, speaks to the 
efficacy of cis-Decoder search algo- 
rithms in detecting neural enhancers. 

Ideally, it would be useful to make 
direct comparisons of the cis-Decoder 
algorithm with other web-based tools 
for discovery and analysis of cis-regu- 
latory elements. However, not all 
search programs use evolutionary 
comparisons, and those that do use 
different levels of evolutionary diver- 
gence to identify conserved sequences 
in enhancers. The comparative analy- 
sis of enhancer discovery programs 
nevertheless points to factors present 
in various computational formats that 
appear to be important for successful 
cis-regulatory element prediction (dis- 
cussed in Su et al., 2010). These 
include sequence conservation 
between related species, motif cluster- 
ing, and availability of prior informa- 
tion on the presence of known tran- 
scription factor-binding sites. In this 
context, combined use of cis-Decoder 
methodology with Chip-Seq data, that 
shows occupancy of cis-regulatory 
modules by specific transcription fac- 
tors (Zinzen et al., 1999; Wilczyhski 
and Furlong, 2010), will improve 
identification of functional motifs 
within enhancers that are bound by 
specific transcription factors, and 
resolves additional functionally im- 
portant flanking sequences. The libra- 
ries of repeat and uniquely shared 
sequences generated by cis-Decoder 
are useful for sub-structural analysis 
of enhancers; for example, discovery 
of the unique element shared by 
Kriippel and giant gap enhancers 
demonstrates the ability of cis- 
Decoder to reveal combinatorial inter- 
actions by analysis of blocks of con- 
served sequences. Other aspects of 
cis-regulatory biology will also be rel- 
evant; for example, the configuration 
of the chromatin as detected by 
DNasel hypersensitivity indicates 
accessibility of enhancer sequences to 
transcriptional regulators (reviewed 
in Suganuma and Workman, 2011). 
The knowledge of chromatin state is 



invaluable for prediction of enhancer 
activity, and information concerning 
specific CSCs can be accessed via the 
UCSC browser. 

Efficacy of cis-Decoder in predicting 
enhancers can be compared to a study 
that used known cis-regulatory 
modules to develop a training set of 
computationally predicted transcrip- 
tion factor-binding sites to predict 
genomic cis-regulatory modules 
(Eouault et al., 2010). That study pre- 
dicted neural expression of the same 
cg7229 enhancer that was identified 
using cis-Decoder (Fig. 3). Likewise 
an algorithm known as Ahab, which 
uses transcription-factor-binding-site 
information for known regulators of 
cellular blastoderm enhancers, suc- 
cessfully predicted the gt_(—10) and 
kni(+l) gap enhancers (Schroeder 
et al., 2004) that also scored highly in 
our search using the KrjCDl gap 
enhancer as the input CSC (Fig. 8). It 
is important to point out that cis- 
Decoder search protocols make direct 
use of CSC information for enhancer 
prediction, while other resources, 
such as Genome Surveyor (Kazemian 
et al., 2011), use site conservation as 
a criterion, but do not provide infor- 
mation to infer enhancer boundaries. 
Given that multiple enhancer predic- 
tion programs that employ different 
search criteria are available, it would 
be advisable to employ several discov- 
ery programs (summarized by Su 
et al., 2010) before settling on a final 
list of candidate genomic regions for 
analysis in enhancer-reporter trans- 
genic studies. 

CONCLUSIONS 

We have generated a Drosophila 
genome-wide database of evolutionar- 
ily conserved DNA sequences that 
allows for discovery of functionally 
related enhancers. A cis-Decoder 
search identifies database CSCs that 
share balanced conserved sequence 
elements with an input enhancer. No 
prior information about the functional 
significance of DNA sequences within 
enhancers is required to identify 
other related enhancers. The data- 
base provides an inventory of con- 
served repeat sequences within CSCs 
and enables comparison between 
input and database CSCs by various 
metrics that allow the user to judge 
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CSC similarity. Starting with a tempo- 
rally restricted NB enhancer, we have 
shown that ess-Decoder can success- 
fully identify other similarly regulat- 
ing enhancers, and we also demon- 
strate how other functionally distinct 
enhancer families can be identified. 

Our comparative analysis of 
enhancers described in this report 
and an additional 60 enhancers, have 
yielded the following observations 
considering enhancer structure and 
behavior: (1) Functionally related 
enhancers can be identified based on 
their balanced copy numbers of shared 
conserved repeat elements. (2) 
Enhancers that have extensive shared 
conserved sequence elements (often 
>60%), but do not have balanced 
shared repeat copy numbers, may dis- 
play significantly different regulatory 
behaviors. (3) Shared repeat and 
unique elements between functionally 
related enhancers are not found in any 
fixed order or orientation. (4) Similarly 
regulating families of enhancers need 
not share specific sets of conserved 
sequence elements, since different 
enhancers can accomplish the same 
regulatory behavior with different but 
overlapping sets of conserved ele- 
ments. (5) Enhancers that share con- 
served repeat elements and perform 
related ess-regulatory functions also 
contain unique sets of repeat elements 
that are only partially shared with 
other related enhancers. 

Our observations have revealed 
that Drosophila CNS developmental 
enhancers are highly complex, based 
on their conserved sequence composi- 
tion, and many have proven to be mul- 
tifunctional. The observed complexity 
of enhancers, specifically with regard 
to multi-copy repeat motifs, also sug- 
gests that enhancer function is real- 
ized through a complex process involv- 
ing combinatorial interactions among 
many factors and cannot be easily 
explained by single activator/repressor 
transcription factor switches. In addi- 
tion, the fact that functionally diverse 
enhancers can display such extensive 
overlap in their conserved sequences 
underscores the combinatorial com- 
plexity of ess-regulation (also see 
Southall and Brand, 2009). Because of 
the lack of fixed order and orientation 
of shared elements between related 
enhancers, only the alignment flexibil- 
ity of the ess-Decoder CSB aligner 



can rapidly detect the extent and 
makeup of shared conserved sequen- 
ces between different enhancers. Until 
now, enhancer boundaries have, for 
the most part, been resolved by re- 
porter transgene deletion analysis. 
The addition of evolutionary clustering 
of conserved sequences to this identifi- 
cation process will aid in enhancer 
identification and allow for an assess- 
ment of their structure and spatial 
constraints. ess-Decoder algorithms 
also allow one to generate libraries of 
conserved sequence elements that are 
shared among enhancers; this dataset 
will be useful for understanding the 
combinatorial complexity of tissue- 
specific gene regulation. 

EXPERIMENTAL 
PROCEDURES 

Conserved Sequence Cluster 
EvoPrints 

EvoPrint conditions for identifying 
the database CSCs are described in 
the Results and Discussion section. For 
the CSCs used as examples in the text 
and figures, relaxed EvoPrints were 
prepared using D. melanogaster DNA 
as the reference sequence and 11 orthol- 
ogous DNA sequences from the D. 
simulins, D. sechellia, D. yakuba, D. 
erecta, D. ananassae, D. pseudoobscura, 
D. persimilis, D. willistoni, D. virilis, D. 
mojavensis, and D. grimshawi species. 

Enhancer Search Protocol 

ess-Decoder (http://cisdecoder.mnds. 
nih.gov/public.do) programs consist of 
an integrated set of search and align- 
ment algorithms that help discover 
conserved sequence elements that are 
shared between similarly regulated 
enhancers. The following is a descrip- 
tion of the sequential steps and 
accompanying algorithms used by 
ess-Decoder protocol to identify repeat 
and palindrome elements within the 
input cluster and to scan a genomic 
CSC database for other CSCs that 
contain the conserved sequence ele- 
ments present in the input CSC. 

The Drosophila CSC database is 
based on relaxed EvoPrints (Yavatkar 
et al., 2008; http://evoprinter.mnds. 
nih.gov/) of >90% of the eukaryotic 
genome. Clusters were identified and 
named using EvoPrint cutter, an 



Image J macro written by Wayne Ras- 
band of NIMH. CSBs were extracted 
in forward and reverse directions, a 
database repository was populated 
with CSC details including a list of 
CSBs, and records of repeat and pal- 
indromic elements and their number 
within each cluster as well as across 
all the clusters in the database. 

Upon input of a user-provided CSC, 
ess-Decoder extracts and annotates 
the CSBs in both forward and reverse 
directions and discovers repeat and 
palindromic sequences within the 
input CSC. The system also invokes a 
database search to find accruals for 
each repeat that is found in database 
repository along with a display of the 
number of database clusters contain- 
ing more than one copy of each 
repeat. The user sets search con- 
straints to limit search to specific list 
of repeat motifs as well as sequence 
type (non-coding, coding, or 3' UTR). 

ess-Decoder then searches the data- 
base repository for clusters that con- 
form to the search constraints. An 
alignment between the input and 
database cluster is generated that 
shows alignments to input enhancer 
multi-copy repeat motifs along with 
input enhancer repeats that were 
excluded from the database search, 
unique alignments, and database 
cluster repeat sequences found as a 
subset of the unique alignments. 

Various alignment scores are com- 
puted to rank the database CSCs; 
scores include the RPS Balance Index 
(a measure of the relative balance of 
the perfectly matched shared RPS ele- 
ments to the shared RPS elements 
that are not matched in frequency 
between input and database CSC), 
the Pearson Correlation Coefficient (a 
measure of the strength of a linear 
relationship between the relative 
occurrence of repeat and unique 
matches in the input vs. the database 
cluster), Number of Shared Repeats, 
Total Shared Elements (including 
repeat, palindromic, and uniquely 
shared elements), Percent Coverage 
(the percent of conserved bases in the 
database CSC aligning to RPS and 
unique matches in the input enhancer 
CSBs), Number of Required Repeats 
(set as a search criterion), Longest 
Shared Sequence, and the number of 
Conserved Bases in the database 
CSC. ess-Decoder then generates lists 
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of shared sequences of input and 
database CSCs, and generates a 
"repeat balance map" visual represen- 
tation of the relative frequency of 
appearance of the KPSs in the input 
versus database CSC (see Fig. 4 legend 
for details; when repeats overlap, the 
balance ratio designation of the longer 
repeat is indicated and when two 
repeats of equal size overlap, the more 
balanced ratio is indicated). In addition 
to the repeat balance map, a represen- 
tation of the balance of shared ele- 
ments within the database CSC in 
comparison to the input is shown in 
the form of a pie chart. 

Enhancer-Reporter 
Transgene Construction 

Fragments for cloning were generated 
by standard PCR protocols. Primer 
sequences are provided in Table SI. 
PCR-amplified genomic fragments 
were cloned into Invitrogen pCRII- 
TOPO vector, sequenced for verification 
of the insert, and recloned into a site- 
specific integration vector, Bullfinch 
(Fig. S3), consisting of a modified 
pCa4B vector (Markstein et al., 2008) 
with an inserted polylinker site, mini- 
mum Heat shock protein 70 promoter 
(from the pRed H-Stinger vector; Bar- 
olo et al., 2004), Gal4 ORF (from S. 
cerevisiae), and SV40 3'UTR (from the 
pRed H-Stinger vector; Barolo et al., 
2004). Bullfinch Vector map is found 
in Supp. Fig. S4. Details of the clon- 
ing steps and vector sequence are 
available upon request. 

Drosophila Stocks and 
P-Element Transformations 

Third chromosome site-specific P-ele- 
ment integration transformants were 
generated in the y, w; y+[attp2] strain 
as previously described (Groth et al., 
2004: Markstein et al., 2008) using 
our Gal4 site-specific vector (see 
above). Embryo Gal4 mRNA in situ 
hybridizations were performed on mul- 
tiple independent transformant lines 
for each construct to assure reproduci- 
ble expression. 

Embryo Transgene Reporter 
Expression Localization 

Embryo collection and fixation were 
performed according to the proce- 



dures described by Patel (1994). For 
in situ hybridization, riboprobes were 
prepared from a PCR-amplified Gal4 
ORF within the Gal4 Bullfinch vector. 
Roche (Indianapolis, IN) DIG RNA 
Labeling Mix protocol was used, and 
staining was visualized using anti- 
FITC Fab fragments coupled to alka- 
line phosphatase. Transgene-reporter 
expression for each of the enhancers 
was examined in at least two inde- 
pendent transgene-reporter trans- 
formant lines, cas mRNA transcripts 
were detected by in situ hybridization 
using a cas ORF digoxigenin probe 
generated from a PCR amplified 
genomic fragment. More detailed pro- 
tocols for embryo processing and in 
situ hybridization are available upon 
request. After whole-mount in situ 
hybridization, embryos were filleted, 
viewed in 70% glycerol with 30% 
phosphate-buffered saline (PBS), and 
photographed using a Nikon (Mel- 
ville, NY) microscope equipped with 
Nomarski (DIC) optics. Embryo devel- 
opmental stages were determined by 
morphological criteria (Campos- 
Ortega and Hartenstein, 1985). 

Immunohistochemistry and 
Confocal Imaging of Larval 
and Adult Brains 

GAL4 expression in the larval brain 
and CNS of wandering larvae was 
analyzed using mCD8::GFP (Lee and 
Luo, 1999) as reporter. Brain dissec- 
tion, immunohistochemistry, and con- 
focal imaging [using a Zeiss LSM710 
and Plan-Apochromat objective 10 x 
(n.a, = 0.45)] were performed as 
described previously (Lee and Luo 
1999). For immunohistochemistry, 
rabbit anti-GFP (1:1,500, Invitrogen, 
San Diego, CA) and Alexa 488 goat 
anti-rabbit IgG (1:1,000, Invitrogen) 
were used to enhance the GFP signal. 
Serial optical sections (1,024x1,024 
pixel resolution) were taken at 2-u,m 
intervals along the dorso-ventral axis. 
The confocal image stacks were ana- 
lyzed using ImageJ software (NIH, 
Bethesda, MD). 

For each genotype, at least three 
adult flies of mixed genders were col- 
lected 3 to 5 days after eclosion and 
used for immunohistochemistry and 
imaging. Brain dissection, immuno- 
histochemistry, and confocal imaging 
[using a Zeiss LSM510 META and 



plan Neofluar objective 40 x (n.a, = 
1.3)] were performed as described pre- 
viously (Gao et al., 2008). For immu- 
nohistochemistry, rabbit anti-GFP 
(1:300, Torrey Pines Biolabs, East Or- 
ange, NJ) and Alexa 488 goat anti-rab- 
bit IgG (1:250, Molecular Probes, 
Eugene, OR) were used to enhance the 
GFP signal. Serial optical sections 
(512x512 pixel resolution) were taken 
at l-um intervals along the rostro-cau- 
dal axis. The confocal image stacks 
were analyzed using Imaris (Bitplane, 
Zurich, Switzerland) software. 
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